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Gaussian Mixture Reduction Using Reverse 
Kullback-Leibler Divergence 

Tohid Ardeshiri, Umut Orguner, Emre Ozkan 


Abstract —We propose a greedy mixture reduction algorithm 
which is capable of pruning mixture components as well as 
merging them based on the Kullback-Leibler divergence (KLD). 
The algorithm is distinct from the well-known Runnalls’ KLD 
based method since it is not restricted to merging operations. 
The capability of pruning (in addition to merging) gives the 
algorithm the ability of preserving the peaks of the original 
mixture during the reduction. Analytical approximations are 
derived to circumvent the computational intractability of the 
KLD which results in a computationally efficient method. The 
proposed algorithm is compared with Runnalls’ and Williams’ 
methods in two numerical examples, using both simulated and 
real world data. The results indicate that the performance and 
computational complexity of the proposed approach make it an 
efficient aiternative to existing mixture reduction methods. 


I. Introduction 

Mixture densities appear in various problems of the estima¬ 
tion theory. The existing solutions for these problems often 
require efficient strategies to reduce the number of compo¬ 
nents in the mixture representation because of computational 
limits. For example, time series problems can involve an ever 
increasing number of components in a mixture over time. The 
Gaussian sum filter Q'j for nonlinear state estimation; Multi- 
Hypotheses Tracking (MHT) E| and Gaussian Mixture Prob¬ 
ability Hypothesis Density (GM-PHD) filter |3j for multiple 
target tracking can be listed as examples of the algorithms 
which require mixture reduction (abbreviated as MR in the 
sequel) in their implementation. 

Several methods were proposed in the literature addressing 
the MR problem. In Q and (5|, the components of the mixture 
were successively merged in pairs for minimizing a cost func¬ 
tion. A Gaussian MR algorithm using homotopy to avoid local 
minima was suggested in [§}. Merging statistics for greedy MR 
for multiple target tracking was discussed in |7j. A Gaussian 
MR algorithm using clustering techniques was proposed in |8j. 
In 0, Crouse et al. presented a survey of the Gaussian MR 
algorithms such as West’s algorithm I'OJ- constraint optimized 
weight adaptation 0 Runnalls’ algorithm 0 and Gaussian 
mixture reduction via clustering (8) and compared them in 
detail. 

Williams and Maybeck © proposed using the Integral 
Square Error (ISE) approach for MR in the multiple hypothesis 
tracking context. One distinctive feature of the method is 
the availability of exact analytical expressions for evaluating 
the cost function between two Gaussian mixtures. Whereas, 
Runnalls proposed using an upper bound on the Kullback- 
Leibler divergence (KLD) as a distance measure between the 
original mixture density and its reduced form at each step of 
the reduction in (12) . The motivation for the choice of an upper 
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bound in Runnalls’ algorithm is based on the premise that the 
KLD between two Gaussian mixtures can not be calculated 
analytically. Runnalls’ approach is dedicated to minimize the 
KLD from the original mixture to the approximate one, which 
we refer to here as the Forward-KLD (LKLD). This choice 
of the cost function results in an algorithm, which reduces 
the number of components only by merging them at each 
reduction step. 

In this paper, we propose a KLD based MR algorithm. Our 
aim is to find an efficient method for minimizing the KLD 
from the approximate mixture to the original one, which we 
refer to as the Reverse-KLD (RKLD). The resulting algorithm 
has the ability to choose between pruning or merging compo¬ 
nents at each step of the reduction unlike Runnalls’ algorithm. 
This enables, for example, the possibility to prune low-weight 
components while keeping the heavy-weight components un¬ 
altered. Lurthermore, we present approximations which are 
required to overcome the analytical intractability of the RKLD 
between Gaussian mixtures making the implementation fast 
and efficient. 

The rest of this paper is organized as follows. In Section [H] 
we present the necessary background required for the MR 
problem we intend to solve in this paper. Two of the most rel¬ 
evant works and their strengths and weaknesses are described 
in Section [III] The proposed solution and its strengths are 
presented in Section 1 1V | Approximations for the fast compu¬ 
tation of the proposed divergence are given in Section [V] The 
proposed MR algorithm using the approximations is evaluated 
and compared to the alternatives on two numerical examples 
in Section [Vi] The paper is concluded in Section |VII| 


II. Background 

A mixture density is a convex combination of (more basic) 
probability densities, see e.g. 1141. A normalized mixture with 
N components is defined as 


N 


p ( x ) = ^2u!iq{x-,vi), 


(i) 


1=1 


where the terms wj are positive weights summing up to unity, 
and r\i are the parameters of the component density q(x;rp). 

The mixture reduction problem (MRP) is to find an ap¬ 
proximation of the original mixture using fewer components. 
Ideally, the MRP is formulated as a nonlinear optimization 
problem where a cost function measuring the distance between 
the original and the approximate mixture is minimized. The 
optimization problem is solved by numerical solvers when the 
problem is not analytically tractable. The numerical optimiza¬ 
tion based approaches can be computationally quite expensive, 
in particular for high dimensional data, and they generally 
suffer from the problem of local optima ©, 0, 0. Hence, 
a common alternative solution has been the greedy iterative 
approach. 



2 


In the greedy approach, the number of components in the 
mixture is reduced one at a time. By applying the same pro¬ 
cedure over and over again, a desired number of components 
is reached. To reduce the number of components by one, 
a decision has to be made among two types of operations; 
namely, the pruning and the merging operations. Each of these 
operations are considered to be a hypothesis in a greedy MRP 
and is denoted as Hu, I = 0,... ,N, J = 1,... ,N, I ^ J. 

1) Pruning: Pruning is the simplest operation for reducing 
the number of components in a mixture density. It is denoted 
with the hypothesis Hoj, J = 1 ,...,N in the sequel. In 
pruning, one component of the mixture is removed and the 
weights of the remaining components are rescaled such that 
the mixture integrates to unity. For example, choosing the 
hypothesis Hoj, i.e., pruning component J from <[l), results 
in the reduced mixture 

1 N 

p(x\H 0J ) = — -- V wiq(x\ rn). (2) 

i*J 

2) Merging: The merging operation approximates a pair 
of components in a mixture density with a single component 
of the same type. It is denoted with the hypothesis Hu, 
1 < I ^ J < N in the sequel. In general, an optimization 
problem minimizing a divergence between the normalized 
pair of the mixture and the single component is used for 
this purpose. Choosing the FKLD as the the cost function 
for merging two components, leads to a moment matching 
operation. More specifically, if the hypothesis Hij is selected, 
i.e., if the components / and J are chosen to be merged, the 
parameters of the merged component are found by minimizing 
the divergence from the kernel wjq(x;r/i) + wjq{x\rjj) to a 
single weighted component {wj + wj)q(x',rju) as follows: 

Vij = argmin D KL {wiq{x\r]i) + Wjq(x-,T]j)\\q(x-,r])), 

where, wj = wj/(wj + wj), wj = wj/(wj + wj) and 
Dkl(p\\q) denotes the KLD from p(-) to q(-) which is defined 
as 

D K L(p\\q)= f p{x) log dx. (3) 

J q(x) 

The minimization of the above cost function usually results 
in the single component q(x; 77 jj) whose several moments are 
matched to those of the two component mixture. The reduced 
mixture after merging is then given as 

N 

p(x\Hij) = ^2 w kQ( x: , Vk) + (Wi + Wj)q( x; rju). (4) 

K=1 

K*I 

K^J 


approach involves all of the components of the original mix¬ 
ture, global approaches are in general computationally more 
costly. On the other hand, since the global properties of the 
original mixture are taken into account, they provide better 
performance. In the following, we propose a global greedy 
MR method that can be implemented efficiently. 

III. Related work 

In this section, we give an overview and a discussion for 
two well-known global MR algorithms related to the current 
work. 

A. Runnalls’ Method 

Runnalls’ method | |T2) is a global greedy MR algorithm that 
minimizes the FKFD (i.e., Dkl(p(')\\p{'))- Unfortunately, the 
KFD between two Gaussian mixtures can not be calculated 
analytically. Runnalls uses an analytical upper-bound for the 
KLD which can only be used for comparing merging hypothe¬ 
ses. The upper bound B(I, J ) for Dxl{p{')\\p('\Hij)), which 
is given as 

B(I,J) =w I D KL (q(x;rn)\\q(x;r]u)) ^ 

+ wjD KL (q(x ; t]j)\\q(x; rju)), 

is used as the cost of merging the components I and J where 
<■/(■, r]u) is the merged component density. Hence, the original 
global decision statistics D k l{p{')\\p{-\Hij)) for merging is 
replaced with its local approximation B{I , J) to obtain the 
decision rule as follows: 

«*,j*=arg min B(I,J). (6) 


B. Williams’ Method 

Williams and Maybeck proposed a global greedy MRA 
in G3 where ISE is used as the cost function. ISE between 
two probability distributions p(-) and q(-) is defined by 

AseO||<?) = J \p(x) - q(x)\ 2 dx. (1) 

ISE has all properties of a metric such as symmetry and 
triangle inequality and is analytically tractable for Gaussian 
mixtures. Williams’ method minimizes Dise(p(-)\\p(-\Hij )) 
over all pruning and merging hypotheses, i.e., 

i* > j* = ar g 0 <j<jv Dise{p(')\\p('\Hu)) . (8) 

1 <J<N 

IAJ 


C. Discussion 


There are two different types of greedy approaches in the 
literature, namely, local and global approaches. The local 
approaches consider only the merging hypotheses Hij, 1 < 
/ / ,/ < N. In general, the merging hypothesis Hu which 
provides the smallest divergence Aocai(?(-; 7 ?/)||9(’; Vj)) i s 
selected. This divergence considers only the components to 
be merged and neglects the others. Therefore these methods 
are called local. Well-known examples of local approaches are 
given in fl5|, [:16j. 

In the global approaches, both pruning and merging opera¬ 
tions are considered. The divergence between the original and 
the reduced mixtures, i.e., D g \ 0 -ba\{p{-)\\p{-\Hu)) is minimized 
in the decision. Because the decision criterion for a global 


In this subsection, first we illustrate the behavior of Run¬ 
nalls’ and Wiliams’ methods on a very simple MR example. 
Second, we provide a brief discussion on the characteristics 
observed in the examples along with their implications. 

Example 1. Consider the Gaussian mixture with two compo¬ 
nents given below. 

p(x) = w\N{x] — p, 1) + W2U{x\ p, 1) (9) 

where /i 2 and W \ + w; 2 = 1. We would like to reduce 
the mixture to a single component and hence we consider the 
two pruning hypotheses Hoi , H 02 and the merging hypothesis 
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T~L\ 2 - The reduced mixtures under these hypotheses are given 
as 

v{x\H 0 i) =AT(x-,/j,, 1), (10a) 

p(x\'Hq 2 ) =AT(x; -/x, 1), (10b) 

p{x\Hi2) (10c) 

where JL and E are computed via moment matching as 

At ={w 2 - wi)n, (11a) 

E =1 + 4uqu; 2 /x 2 . (Hb) 

Noting that the optimization problem 

aA P* = argmin D KL (p(-)\\Af(-, p, P)) (12) 

h,p 

has the solution [i* = JL and P* = E, we see that the 
density p(x\7~Li 2 ) is the best reduced mixture with respect to 
FKLD. Similarly, Runnalls’ method would select 'H \ 2 as the 
best mixture reduction hypothesis because it considers only 
the merging hypotheses. ■ 

Example 2. We consider the same MR problem in Example |T| 
with the Williams’ method. The ISE between the original 
mixture and the reduced mixture can be written as 

T , ise(p(-)IIp(’I^/j)) - Jp{x\Hij) 2 dx 

- “2 Jp{x)p{x\Hu) da; (13) 

where the sign = means equality up to an additive constant 
independent of Hu- Letting p{x\Hu) = J\f(x ; pu, E jj), we 
can calculate £ , ise(p(-)IIp(-|'^/j)) usin g <d as 

-Dise(p(-)I IpGI^/j)) /x ij, 2E u) 

- 2u>iAC(a t/j; -/r, 1 + E/j) 

- 2w 2 M{hij\jl, 1 + E/j). (14) 

Hence, for the hypotheses listed in ( fTO) , we have 

Ase(p(OIIp(-|'^oi)) =(1 - 2 w 2 )J\f(p;n, 2 ) 

-2 Wl M(jr,-n,2) (15) 

=(1-21172)^(0; 0,2) 

-2w 1 J\f(p-,-iJ 1 ,2) (16) 

Ase(p(-)IIp(‘I^ 02 )) -(! ~ 2w 1 )M{-n\-n, 2) 

-2w 2 M{-^^2) (17) 

=(1 - 2w 1 )A/'(0; 0,2) 

-2w 2 N{jl-,-jl,2) (18) 

Ase(p(-)IIp(-|^i2)) =AC(/x; AX, 2 E) 

- 2wiN(ji\ —/r, 1 + E) 
-2w 2 AA(At;/x,l + E) (19) 

=AA(0; 0, 2E) 

- 2wiJ\T(2w 2 fj,\ 0,1 + E) 

- 2w 2 N(2 Wl ^ 0,1 + E) (20) 

Restricting ourselves to the case where w\ = w 2 = 1/2 and 


/x is very large, we can see that 
L ) ise(p(-)IIp('I^oi)) ~ -7i= e_/i 

v47T 

Ase(p(-)IIp(-|^02)) » - 4 =^^ (22) 

v47T 

(23) 

For sufficiently large /x values, it is now easily seen that 

L , ise(p(-)IIp(-|Hi 2 )) <Ase(p(-)IIpOI^oi)) 

= -Dise(p(-)IIp(-|'^02))- (24) 

Hence, under the aforementioned conditions, the merging 

hypothesis is selected by Williams’ method. ■ 

As basically illustrated in Example [I] FKLD has a tendency 
towards the merging operation no matter how separated the 
components of the original mixture are. Similarly Runnalls’ 
method considers only the merging operations ruling out the 
pruning hypotheses from the start. The significant preference 
towards the merging operation tends to produce reduced 
mixtures which may have significant support over the regions 
where the original mixture have negligible probability mass. 
This is called as the zero-avoiding behavior of the KLD in the 
literature 0 p. 470], Such a tendency may be preferable 
in some applications such as minimum mean square error 
(MMSE) based estimation. On the other hand, it may also 
lead to a loss of the important details of the original mixture, 
e.g., the mode, which might be less desirable in applications 
such as maximum a posteriori (MAP) estimation. In such 
applications, having the pruning operation as a preferable 
alternative might preserve significant information while at the 
same time keeping a reasonable level of overlap between the 
supports of the original and the reduced mixtures. 

Example [2] illustrated that a similar tendency toward merg¬ 
ing (when the components are far from each other) can appear 
in the case of Williams’ method for some specific cases 
(weights being equal). It must be mentioned here that, in 
Example [2] if the weights of components were not the same, 
Williams’ method would select to prune the component with 
the smaller weight. Therefore, the tendency toward merging 
(when the components are far from each other) is significantly 
less in Williams’ method than in FKLD and Runnalls’ method. 
It is also important to mention that, in some target tracking 
algorithms such as MHT and GM-PHD filters, mixtures with 
some components with identical weights are commonly en¬ 
countered. 

Williams’ method, being a global greedy approach to MR, 
is computationally quite expensive for mixture densities with 
many components. The computational burden results from 
the following facts. Reducing a mixture with N components 
to a mixture with N — 1 components involves 0(N 2 ) hy¬ 
potheses. Since computational load of calculating the ISE 
between mixtures of N and N — 1 components is 0(N 2 ), 
reducing a mixture with N components to a mixture with 
TV — 1 components has the computation complexity 0(N A ) 
with Williams’ method. On the other hand, using the upper 
bound 0. Runnalls’ method avoids the computations associ¬ 
ated with the components which are not directly involved in 
the merging operation resulting in just 0(N 2 ) computations 
for the same reduction. Another disadvantage of Williams’ 
method is that the ISE does not scale up with the dimension 
nicely, as pointed out in an example in [12). 
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IV. Proposed Method 

We here propose a greedy global MR algorithm based on 
KLD which can give credit to pruning operations and avoid 
merging (unlike Runnalls’ method) when components of the 
mixture are far away from each other. The MR method we 
propose minimizes the KLD from the reduced mixture to the 
original mixture, i.e., RKLD. Hence we solve the following 
optimization problem. 

I*,J* = arg o min N D KL (p(x\Hij)\\p(x)). (25) 
1 <J<N 


By using the RKLD as the cost function, we aim to enable 
pruning and avoid the ever merging behavior of Runnalls’ 
method unless it is found necessary. We illustrate the char¬ 
acteristics of this cost function in MR with the following 
examples. 

Example 3. We consider the same MR problem in Example [T| 
and [2] when p > 0 is very small. When p is sufficiently close 
to zero, we can express Dkl(p{x\Hij)\\p(x)) using a second- 
order Taylor series approximation around p = 0 as follows: 


where 


Dkl{p(x\Hu)\\p(x)) ~ \ c iJV 2 


° IJ ~ 


f-L—O 


(26) 


(27) 


This is because, when p = 0, both p(-) and p(-\Hu) are 
equal to 7V(-;0,1) and therefore Dkl(p{x\Hij)\\p(x)) = 0. 
Since D k l{p{x\Hi j)\\p(x)) is minimized at p = 0, the first 
derivative of Dkl{p{x\Hij)\\p{x )) also vanishes at p = 0. 
The second derivative term Cjj is given by tedious but 
straightforward calculations as 


c/j = 


SjlK*) ^ -LPWij 


fi—0 




fi—0 


Af(x\ 0,1) 


dx. (28) 


Using the identity 

d 


—7V(x; p, 1) = (x - p)U{x-, p, 1) 


(29) 


we can now calculate Coi, C 02 and C 12 as 
c 0 i =4 w{ 
c 02 =4 u: 2 , 

C 12 =0. 


(30a) 

(30b) 

(30c) 


Hence, when p is sufficiently small, the RKLD cost function 
results in the selection of merging operation similar to FKLD 
and Runnalls’ method. ■ 


Example 4. We consider the same MR problem in Example |T] 
and [2] when p is very large. In the following, we calculate the 
RKLDs for the hypotheses given in ( fTO} . 

• Hoi and Ho 2 - We can write the RKLD for Hoi as 

Dkl{p(-\H 0 i)\\p(-)) = - E Hoi \\ogp(x)] - H(p(-\H 0 i)), 

(31) 

where the notation Eu I j [■] denotes the expectation oper¬ 
ation on the argument with respect to the density function 


p(-\Hu) and H(-) denotes the entropy of the argument 
density. Under the assumption that p is very large, we 
can approximate the expectation Ep_ 01 [logp(x)] as 

En 01 [logp(x)] 

= J Af(x;p, 1) 

x log (wi—p, 1) + w 2 J\f(x; p, 1)) dx (32) 
« J A f(x; p , 1) log (tn 2 A f{x\p, 1)) dx (33) 

= logu ;2 - log \J2-k - i (34) 

where we used the fact that over the effective integration 
range (around the mean p), we have 

— p, 1) ~ 0. (35) 

Substituting this result and the entropjQ of p(-\Hoi) 
into along with, we obtain 

Dkl(p(-\H 0 i)\\p{-)) ~ -\ogw 2 . (36) 

Using similar arguments, we can easily obtain 

Dkl{p(-\Ho 2 )\\p{-)) ~ - \ogwi. (37) 


Noting that the logarithm function is monotonic, we can 
also see that the approximations given on the right hand 
sides of the above equations are upper bounds for the 
corresponding RKLDs. 

• %i 2 - We now calculate a lower bound on 
Dkl(p(x\Hi 2 )\\p{x)) and show that this lower 
bound is greater than both — log Wi and — log w 2 when 
p is sufficiently large. 


Dkl(p(x\Hi 2 )\\p{x)) = —E Hl2 [logp(x)] -H(p(-\Hi 2 )) 

(38) 

We consider the following facts 

AC(x; —/x, 1) > A f{x\p, 1) when x < 0, (39a) 
A f(x; p , 1) > A/"(x; — p, 1) when x > 0. (39b) 

Using the identities in ( |39| ) we can obtain a bound on the 
expectation E Ul2 [logp(xj] as 


EH 12 \^ogp{x)] 

=E U 12 [log (wiN{x\ -p, 1) + w 2 J\f(x; p, 1))] 

< / AC(x; p, E) log (A/"(x; — p, 1)) dx 
J x<0 

+ j Af(x; p, E) log (A/"(x; /x, 1)) dx (40) 
J x>0 

= - [ A f(x-,p,E)(\ogV2^+ {X + fir 

Jx< 0 V 


2 

(x - p) 2 


dx 


dx 

(41) 


- [ AC(x; p , E) ( log V2n + 

Jx> 0 V z 

<-iogV2i- I Af(x;P,S)L±^£±L dl 

Jx< o 2 

(42) 




/ x>0 


1 Entropy of a Gaussian density A/*(-, fi, cr 2 ) is equal to log V2nea 2 . 
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— log ypht 

p 2 + 

S + p 2 
2 

- P 

f X 

’Af(x; p, 

E) dx 


/ x<0 



+ P 

f * 

’Af(x; p, 

E) dx 


/ tc>0 



— log ypht 

P 2 + 

E + p 2 
2 

- P 

/ ' x 

’Af(x; p, 

E) dx 


(43) 


+ p^p — J xAf{x\ p, E)^ da; (44) 

1 /o p 2 + T> + p 2 
= -logV27T---h PP 

— 2/a / xAf(x; p , E) dx 

J :r<0 


= — log y/2n — 


(p~ PF + T, 


where we have used the result 

[X 

I xAf(x, /Li, E) dx 


(45) 


(46) 


= p4> 


X — /Li 


(47) 


Ve y V \/s 

Here, the functions </>(•) and <1>(■) denote the probability 
density function and cumulative distribution function of 
a Gaussian random variable with zero mean and unity 
variance, respectively. Substituting the upper bound ( |46| ) 
we obtain 


into 

D KL (p(x\Hi 2 )\\p(x)) 

^ 4 l0gS4 


(p - p) 2 + E 




(48) 


~ ^ ^ ^ log (4wi w 2 p 2 ) + 2p 2 fw-L + ( w 2 - wi) 

Jf, f Wi - W 2 \ n -, ( Wi -W 2 \ 

= - - - - log ( Aw x w 2 p 2 ) + 2g(w 1 ,w 2 )p 2 


(49) 


for sufficiently large p values where we used the defini¬ 
tions of p, E and the fact that when p tends to infinity, 
we have 


P 


w i — w 2 


a/e y/4wiW2 

The coefficient g(w\. w 2 ) in ( |49| is defined as 

g(wi,w 2 ) =wi + (w 2 - wi)$ ( U l U2 ) 

v v4rciW2/ 


(50) 


— \ZAw\w 2 (j) 


wi-w 2 \ 

\/4wiw 2 J 


(51) 


When p tends to infinity, the dominating term on the right hand 
side of (|49|> becomes the last term. By generating a simple 


plot, we can see that the function g{w i,l — w{) is positive 
for 0 < w± < 1, which makes the right hand side of (|49| go 
to infinity as p tends to infinity. Consequently, the cost of the 
merging operation exceeds both of the pruning hypotheses for 
sufficiently large p values. Therefore, the component having 
the minimum weight is pruned when the components are 
sufficiently separated. ■ 


As illustrated in Example |4j when the components of the 
mixture are far away from each other, RKLD refrains from 
merging them. This property of RKLD is known as zero¬ 
forcing in the literature 114 p. 470]. 

The RKLD between two Gaussian mixtures is analytically 
intractable except for trivial cases. Therefore, to be able 
to use RKLD in MR, approximations are necessary just as 
in the case of FKLD with Runnalls’ method. We propose 
such approximations of RKLD for the pruning and merging 
operations in the following section. 


V. Approximations for RKLD 

In sections V-A and V-B| specifically tailored approxima¬ 
tions for the cost functions of pruning and merging hypotheses 
are derived respectively. Before proceeding further, we would 
like to introduce a lemma which is used in the derivations. 


Lemma 1. Let qi(x), qj(x), and Pk{x) be three probability 
distributions and wj and Wj two positive real numbers. The 
following inequality holds 


q K {x) log 


Qk(x) 


dx < 


wiqi{ x) + wjqj(x) 

- log (wi exp(-D KL (q K \\qi)) + wj exp(-D KL (q K \\qj))) 

(52) 


Proof For the proof see Appendix [A] 


□ 


A. Approximations for pruning hypotheses 
Consider the mixture density p(-) defined as 

N 

p ( x ) = ^2,w I q I {x), ( 53 ) 

7=1 

where qi(x) = q(x;rji). Suppose we reduce the mixture in 
© by pruning the 7th component as follows. 

p(x\H 0 i) = ——— Y Wiq-i{x). (54) 

1 — Wr 

ie{ l...jv}—{/} 

An upper bound can be obtained using the fact that the 
logarithm function is monotonically increasing as 

Dkl{p{-\'H 0 i)\\p{-)) = /p(x|77o/)log ^ J ^ OJ ) dx 
J P{x) 

[ -v ln/ a i p(x\H Q i) ^ 

= / p{xH 0 i)\og- - , -dx 

J (1 - wi)p{x\Hqi) + wiqi 

< - log(l - Wj) + f p(x\Hoi) log dx 

J p{x\Hoi) 

= -log(l -wi). (55) 

This upper bound is rather crude when the /th com¬ 
ponent density is close to other component densities in 
the mixture. Therefore we compute a tighter bound on 
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Dkl(p(-\'Hoi)\\p(-)) using the log-sum inequality fTT). Be¬ 
fore we derive the upper bound, we first define the following 
unnormalized density 

r(x) = ^2 w i<li{x) (56) 

where J I and f r(x ) da; = 1 — wj — wj. 

We can rewrite the RKLD between p(-\7ioi) and p (-) as 



Dkl(p(-\'Hoi)\\p(-)) = 


1 — Wi 


(r(x) +wjqj( x)) 


x log ■ 


(r(x) +wjqj(x)) 


1 — Wj 


r(x) + wiqi(x) + wjqj(x) 
= ~ log(l - U>/) + 


da; 


x log 


1 — Wi 

r(x) + wjqj( x) 


r(x) + wiqi(x) + wjqj(x) 


(r(x) + wjqj(x)) 

dx. (57) 


Using the log-sum inequality we can obtain the following 
expression. 


Dkl{p{-\T-Loi)\\p{-)) 

< - log(l - Wi) + ——— / r(x) 

1 ~wj ) 


log r M dr 


-(a;) 


+ 


1 — wi 


wjqj{x) log 


wjqj{a 


wiqi(x) + Wjqj(x) 


da; 


= - log(l -Wi) + — 
Wj 


Wj 


wi 


+ 


1 — wi 


qj(x) log 


log (wj) 

Qj( x ) 


da;. (58) 


wiqi(x ) + wjqjix) 

Applying the result of Lemma [1] on the integral, we can write 

D K l(p{-\‘Hoi)\\p(-)) < -log(l-tuj) 

log (l + — exp(-D i< - L (q , j||gr / )) N ) . (59) 

\ wj J 


Wj 


1 — wi 


Since J is arbitrary, as long as ./ 7 ^ /, we obtain the upper 
bound given below. 


Dkl{p{-\Hqi)\\p{-)) < 


log(l - Wi) 


log ( 1 

1 — Wj 


Wj 


+ — exp(-D KL (qj\\qi)) 

\ wj 


(60) 


The proposed approximate divergence for pruning component 
I will be denoted by i?(0,1), where 1 < I < N in the rest 
of this paper. In the following, we illustrate the advantage of 
the proposed approximation in a numerical example. 

Example 5. Consider Example [T] with the mixture © 
and the hypothesis ( |10b| >. In Figure |T| the exact divergence 
Dkl(p('\Ho 2 )\\p(’))> which is computed numerically, its 
crude approximation given in ( |55[ i and the proposed approx¬ 
imation R( 0,2) are shown for different values of // with 
wi = 0.8. Both the exact divergence and the upper bound 
R( 0, 2) converge to — log(l— W 2 ) when the pruned component 
has small overlapping probability mass with the other com¬ 
ponent. The bound R( 0, 2) brings a significant improvement 
over the crude bound when the amount of overlap between the 
components increases. ■ 


Figure 1 : The exact divergence Dkl(p('\Ro 2 )\\p(')) (solid 
black), the crude upper bound — log(l — wi ) (dotted black) 
and , the proposed upper bound R( 0, 2) (dashed black) are 
plotted for different values of //. 


B. Approximations for merging hypotheses 

Consider the problem of merging the /th and the Jth 
components of the mixture density ( |53| where the resulting 
approximate density is given as follows. 

p{x\Hu) =wuqu{x) + ^2 Wiqi(x). (61) 


We are interested in the RKLD between p(-\Ru) and p(-). 
Two approximations of this quantity with different accuracy 
and computational cost are given in sections | V-B1 1 and |V-B2| 
1) A simple upper bound: We can compute a bound on 
D K l{p{-\Rij)\\p(-)) as follows: 


Dkl(p(-\Rij)M-)) 


r(x) + wijqu(x) 


= / M*) + ».®W) log r(x) + wm(x) + wjqj[x) 

< [ r{x) log ^ dx 
J r(x) 

+ / wijqu(x) log 


da; 


wijqujx) 
wiqi(x) + Wjqj[x) 


=wu log w u + wij J qu{x) log 


qu(x) 


wiqi(x) + wjqj(x) 


da; 

(62) 


where the log-sum inequality is used and r(a;) is defined 
in ( |56| ). Using Lemma |T| for the second term on the right 
hand side of ( |62) i, we obtain 

Dkl{p(-\Rij)\\p{-)) < wij log wij - wijx 

log (wi exp(■- D kl (qi j 11 qi )) + w j exp (■-D KL (qi j \ \ qj ))). 

(63) 


2) A variational approximation: The upper bound derived 
in the previous subsection is rather loose particularly when 
the components qi and qj are far from each other. This is 
because of the replacement of the second term in ( [62] ) with its 
approximation in ( |63| which is a function of DKL{qu\\qi) 
and D KL (qu\\qj). 

For a brief description of the problem consider 

D^quWwrqi+wjqj) = 

[ qu(x) log- -—da; (64) 

J wiqi{x) +wjqj{x) 

and its approximation using Lemma [T] given as 
- log (wi exp(-D KL (qu\\qi)) + wj exp(-D KL (qu\\qj))). 
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The integral involved in the calculation of Dkl{<1ij\\<1i) 
grows too much (compared to the original divergence 
D(qu\\wiqi + wjqj )) over the support of qj because qjj 
shares significant support with qj. Similarly, the integral in 
DKL(qu\\qj) grows too much (compared to the original di¬ 
vergence D(qjj\\u>iqi+wjqj)) over the support of qi because 
qu shares significant support with qj. To fix these prob¬ 
lems, we propose the following procedure for approximating 
D(qj j | \wjqi+wjqj) using the boundedness of the component 
densities. In the proposed variational approximation, first, an 
upper bound on D(qu\\wiqi +wjqj ) is obtained using the 
log-sum inequality as 

D(qu\\wiqi + wjqj) < / aq u log —^ dx 

J Wjqi 

+ / (1 - a)qu log —— °^ qU dx (65) 

J wjqj 

=a / qu log < ^~ dx + a log — 

J qi wi 

+ (1 - a) [ qu log — dx + (1 - a) log -—- (66) 

J qj wj 

where 0 < a < 1. Second, a switching function is applied to 
the integrals in (|66]> as follows. 



Figure 2: The exact divergence D k l(p('\Ri 2 )\ \p(')) (solid 
blue), the simple approximation given in ( |63] > (dotted blue), 
and the proposed approximation f?(l, 2 ) (dashed blue) are 
plotted for different values of /i. In addition, the exact pruning 
divergence D ^ l(p('\Rq 2 )\\p(')) (solid black), the crude prun¬ 
ing upper bound — log(l — 1 / 72 ) (dotted black) and the proposed 
pruning upper bound R( 0, 2) (dashed black) are plotted. 

When we substitute this approximation into (62} , the proposed 
approximate divergence for merging becomes 

D KL (p(-\TLu)\\p(-)) ~ wij log wij - w u x 

log (wiexp(-V(q IJ ,q Jl qi)) +wjexp(-V(q I j,q I ,qj))) . 

(70) 


D(qu\\wiqi +Wjq.j) 

” / aqij(x) (1 -fp) log + “ log Hr, 
+ / <* - <*>«»« (l - |M) 1°8 <** 

+ (1 - a) log -—- (67) 

Wj 

where q^ ax = max^ qi(x). Here, the multipliers ^1 — ; 

and ^1 — 2 ^ in the integrands are equal to almost unity 

everywhere except for the support of qj and qj respectively, 
where they shrink to zero. In this way, the excessive growth 
of the divergences in the upper bound are reduced. Note also 
that the terms qy- ax and q r 7 nax are readily available as the 
maximum values of the Gaussian component densities. Finally, 
the minimum of the right hand side of ( |67j ) is obtained with 
respect to a. The optimization problem can be solved by taking 
the derivative with respect to a and equating the result to zero 
which gives the following optimal value 

* wiexp(-V{q IJl q j7 q I )) 

Oi = - 

w/ exp(-F (qu, qj , qi)) + wj exp(-H (q u , qi,qj))) ’ 

where 

V(q K ,qi,qj) = J qn(x) (l - log dx. 

( 68 ) 

A closed form expression for the quantity V(qK,qi,qj) is 
given in Appendix [B] Substituting a* into ( |67| results in the 
following approximation for D(qjj\\wiqi + wjqj). 

D(q IJ \\wiq I + wjqj) « 

- log (wi exp(—V (qu, qj, qi)) + Wj exp(-H (q u , qi, qj ))) 

(69) 


The proposed approximate divergence for merging compo¬ 
nents I and J in (70} will be denoted by R(I , J) in the rest 
of this paper. 


Example 6. Consider Example |T] with the mixture © and 
the hypothesis (10c} . In Figure [2] the exact divergence 
Dkl(p('\Ri 2 )\ |p(-))> which is computed numerically, the 
simple approximation given in (63} and the proposed ap¬ 
proximation R( 1,2) are shown for different values of /i with 
wi = 0 . 8 . 

Note that the range at which approximations are interesting 
to study is where D KL (p(-\H 12 )\\p(-)) < -log(l - w 2 ). 
This is because of the fact that when Dkl(p(-\Rv 2 )\\p(')) > 
— log(l — w 2 ), pruning hypothesis R 02 will be selected. 
The range of axes in Figure [2] is chosen according to the 
aforementioned range of interest. 

The proposed variational approximation f?(l,2) performs 
very well compared to the simple upper bound which grows 
too fast with increasing // values. Furthermore, the points 
where the curves of exact pruning and merging divergences 
cross (the crossing point of the solid lines in Figure [2} is close 
to the crossing point of the curves of best approximation of 
these two quantities (the crossing point of the dashed lines in 
Figure |2}. ■ 


The approximate MRA proposed in this work uses the two 
approximations for pruning and merging divergences, R(I, J) 
where 0<I<N,1<J<N, and / 7 ^ J. We will refer 
to this algorithm as Approximate Reverse Kullback-Leibler 
(ARKL) MRA. The precision of ARKL can be traded for less 
computation time by replacing the approximations for pruning 
and merging hypotheses by their less precise counterparts 
given in sections V-A and |V-B 


VI. Simulation Results 

In this section, the performance of the proposed MRA, 
ARKL, is compared with Runnalls’ and Williams’ algorithms 
in two examples. In the first example, we use the real 
world data from terrain-referenced navigation. In the second 
























Figure 3: The data for the original mixture density is courtesy of Andrew R. Runnalls and shows the horizontal position errors 
in the terrain-referenced navigation. The unit of axes is in kilometers and are represented in UK National Grid coordinates. 
The original mixture in 3a is reduced in 3b 3c and 3d using different MRAs. 



(a) Original density (b) 6 clusters by EM (c) 15 clusters by EM (d) Runnalls’ (e) Williams' (f) ARKL 

Figure 4: Results of the clustering example. In the top row, the contour plots of the GMs are given. The bottom row presents 
the corresponding data points associated to the clusters in different colors. 


example, we evaluate the MRAs on a clustering problem with 
simulated data. 


A. Example with real world data 

An example that Runnalls used in 112 Sec.VIII] is re¬ 
peated here for the comparison of ARKL with Runnalls’ and 
Williams’ algorithms. In this example a Gaussian Mixture 
(GM) consisting of 16 components which represents a terrain- 
referenced navigation system’s state estimate is reduced to a 
GM with 4 components. The random variable whose distribu¬ 
tion is represented by a GM is 15 dimensional but only two 
elements of it (horizontal components of position error) are 
visualized in 112 Sec.VIII]. We use the marginal distribution 


of the visualized random variables to illustrate the differences 
of MRAs. 

Figure [3a] shows the original mixture density. Each Gaussian 
component is represented by its mean, its weight, which is 
proportional to the area of the marker representing the mean, 
and its 50 percentile contour (an elliptical contour enclosing 
the 50% probability mass of the Gaussian density). The thick 
gray curves in Figure [3] are the 95 percentile contour and 
the thick black curves show the 50 percentile contour of the 
mixture density. 


In Figures 3b and 3c the reduced GMs using Runnalls’ 


and Williams’ methods are illustrated by their components and 


their 50 and 95 percentile contour^] Figure 3d illustrates the 
result of the ARKL algorithm. 

Williams’ algorithm is the only algorithm that preserves the 
second mode in the 50 percentile contour, but has distorted 
the 95 percentile contour the most. The 50 percentile contour 
of ARKL is the most similar to the original density. Also, the 
shape of the 95 percentile contour is well preserved by ARKL 
as well as Runnalls’. An interesting observation is that the 95 
percentile contour for ARKL has sharper corners than that 
obtained by Runnalls’ algorithm which is a manifestation of 
the pruning decisions (zero-forcing) in ARKL as opposed to 
the merging (zero-avoiding) decisions in Runnalls’ algorithm 
for small-weight components. 

The MRAs are compared quantitatively in terms of di¬ 
vergences, FKLD, RKLD and ISE in Table | The FKLD 
and RKLD are calculated using MC integration while ISE 
is calculated using its analytical expression. Naturally, MRAs 
obtain the least divergence with their respective criteria. That 
is, Runnalls’ algorithm obtains the lowest FKLD, Williams’ 
algorithm obtains the lowest ISE, and ARKL obtains the 
lowest RKLD. In terms of ISE, Runnalls’ algorithm and 


2 The difference between Figures [3b| and|3c|and 1 12; Fig.2] is because of the 
exclusion of the pruning hypotheses in implementation of Williams' algorithm 
in on Fig.2] and the fact that we are only using the marginal distribution of 
the two position error elements in the random variable. 
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Table I: Comparison of MRAs in terms of the divergence 
between the reduced mixture density and the original density. 


Table II: The parameters of the original mixture density used 


in the example of section VI-B and presented in Figure 4a 



Mixture Reduction Algorithms 


Divergence Measures 

Williams’ 

Runnalls’ 

ARKL 

3 

ISE 

0.0175 

0.0255 

0.0271 

1 

FKLD 

0.1177 

0.0665 

0.3024 

RKLD 

0.1325 

0.1234 

0.0340 



2 


ARKL seems to be very close to each other which implies 3 

that for large weight components (large probability mass), - 

these algorithms work similarly. In terms of FKLD, the 
Williams’ algorithm obtains much closer a value to that of 
Runnalls’ algorithm than ARKL which means that the merging 
(zero-avoiding) properties of Williams’ algorithm is closer to • 
Runnalls’ algorithm. Finally, in terms of RKLD, Williams’ 
algorithm and Runnalls’ algorithm have similar and larger • 
values than ARKL. This makes us conclude that pruning (zero¬ 
forcing) ability of Williams’ algorithm is as bad as Runnalls’ 
algorithm for this example. 


1 < j < 3 


4 < j < 6 


0.2 


0.2 


0.2 




' 4 ' 


2 O' 

-4 


0 1 


6 0.1 


fij 


' 1 

0.5 

O O 

,03 03 

4 

0.2 

■ i r 

' 1 

0.2 

o o 

,03 to 

5 

0.1 

'-7 

0 


' 2 - 2 ' 
-2 3 

0.1 O' 
0 3 

'0.1 O' 
0 3 


Maximization (EM) algorithm [ [14] Chapter 9], 

The clustering of the corrupted data into 15 clusters using 
EM algorithm. 

The clustering of the corrupted data into 15 clusters using 
EM algorithm and then reduction to 6 clusters using 


- Runnalls’ algorithm 

- Williams’ algorithm 


- ARKL 


B. Robust clustering 

In the following, we illustrate the necessity of the pruning 
capability of an MRA (as well as merging) using a clustering 
example in an environment with spurious measurements. 

Consider a clustering problem where the number of clusters 
N c is known a priori. Also, assume that the data is corrupted 
by a small number of spurious measurements, which are the 
data points not belonging to any of the true clusters and can 
be considered to be outliers. The clustering task is to retrieve 
N c clusters from the corrupted data. 

A straightforward approach to this problem is to neglect 
the existence of the spurious measurements and cluster the 
corrupted data directly into N c clusters. Since this solution 
does not take the spurious measurements into account, the re¬ 
sulting clusters might not represent the true data satisfactorily. 
Another solution to the problem is to first cluster the data 
points into N > N c clusters and then reduce the number 
of clusters to N c . When GMs are used for clustering, an 
appropriate MRA can be used to reduce the number of clusters 
(component densities) from N to N c . An MRA which can 
prune components with small weights which are distant from 
the rest of the probability mass is a good candidate. On the 
other hand, an MRA which reduces the mixture using only 
merging operations is not preferable because the spurious 
measurements would corrupt the true clusters. Furthermore, 
when a component is being pruned/discarded the data points 
associated to the pruned component density can also be 
discarded as potential spurious data points. Consequently, it is 
expected that ARKL and Williams’ algorithm might perform 
better than Runnalls’ algorithm in this example because of 
their pruning capability. 

In this simulation n = 1000 data points are generated from 
a Gaussian mixture with N c = 6 components, {xi}" =1 ~ 
p(x) = [Cjii WjAf(x: Hj, T,j) where the parameters of p(x) 
are given in Table [II] The data points are augmented with 
to = 100 spurious data points sampled uniformly from a 
multivariate uniform distribution over the square region A, 
i.e., {y,;}™ j ~ U{A), where the center of the square is at the 
origin and each side is 20 units long. The clustering results of 
the following methods are presented. 

• The direct clustering of the corrupted data, i.e., z = 
{{ x i}i=ii {Ui}rLi} into 6 clusters using Expectation 


In Figure [4] the contour plots of the densities along with the 
corresponding data points are illustrated. In Figure [4a] (top) 
the contour plots of the true clusters are plotted. The true 
(black) and spurious (green) data points are shown in Figure [4a] 
(bottom). 

In Figure [4b] the contour plot (top) of the GM obtained 
by EM algorithm clustering the corrupted data directly into 6 
clusters is shown with the corresponding cluster data points 
(bottom). The direct clustering of the corrupted data into 6 
components results in a clustering which does not capture 
all of the true clusters since a spurious cluster with a large 
covariance is formed out of spurious data points (red data 
points). 

As a remedy, the data is first clustered into 15 clusters 
(see Figure [4c]) using EM algorithm and the resulting 15- 
component GM is then reduced to 6 components using Run¬ 
nalls’ algorithm (Figure |4d| , Williams’ algorithm (Figure [4e]) 
and ARKL (Figure [4f|i. With these algorithms, when a com¬ 
ponent is pruned (which might be the case for Williams’ 
algorithm and ARKL), the data points associated with that 
component are discarded as well. On the other hand, when 
a component is merged with another component, data points 
associated with the merged component are reassigned to the 
updated set of clusters according to their proximity measured 
by Mahalanobis distance. 

The final clustering obtained by Runnalls’ algorithm is 
almost the same as the result obtained by the EM algorithm 
with 6 clusters. This observation is in line with the Maximum 
Likelihood interpretation/justification of the Runnalls’ algo¬ 
rithm given in fl2) . Williams’ algorithm and ARKL capture 
the shape of the original GM and the 6 clusters much better 
than Runnalls’ algorithm because these two algorithms can 
prune (as well as merge) while Runnalls’ algorithm can only 
merge. While some of the spurious clusters/measurements are 
merged with the true clusters by Williams’ algorithm, almost 
all spurious clusters/measurements are pruned by ARKL. 
Lastly, as aforementioned, the computation complexity of 
Williams’ method is 0(N i ) while reducing a mixture with N 
components to a mixture with N — 1 components. However, 
Runnalls’ and the ARKL methods have 0(N 2 ) complexity, 
which makes them preferable when a fast reduction algorithm 
is needed for reducing a mixture with many components. 
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VII. Conclusion 


of (|52| as 


We investigated using the reverse Kullback-Leibler diver¬ 
gence as the cost function for greedy mixture reduction. Pro¬ 
posed analytical approximations for the cost function results 
in an efficient algorithm which is capable of both merging 
and pruning components in a mixture. This property can help 
preserving the peaks of the original mixture after the reduction 
and it is missing in FKLD based methods. We compared the 
method with two well-known mixture reduction algorithms 
and illustrated its advantages both in simulated and real data 
scenarios. 
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Appendix 


A. Proof of Lemma [7] 

In the proof of Lemma [I] we use the same approach as the 
one used in 118 Sec. 8]. Let 0 < a < 1. Using the log-sum 
inequality we can obtain an upper bound on the left hand side 


[ qK log-^-dx 

J wiqi + wjqj 

= f ( aq K + (1 - o)q K ) log QgA + ^ ^ dx 
J wiqi + wjqj 

< f aqK log a<lK dx + [ (1 - a)gif log —— °^ gA dx 
J wiqi J wjqj 

= a qx log — da: + a log — 

J qi wi 

+ (1 - a) [ qx log — da; + (1 — a) log -—— 

J qj wj 

= aD K L(qK\\qi) + alog-b (1 - a)D K L{qK\\qj) 

Wj 

+ (1 — a ) log “—-• (71) 

Wj 

The upper bound can be minimized with respect to a to obtain 
the best upper bound. The minimum is obtained by taking the 
derivative of ( fTT| with respect to a; equating the result to zero 
and solving for a, which gives 

• Wiexp{-D KL (q K \\q I )) 

a = -—. 

Wi exp(-D K L(qK\\qi)) + Wjexp(-DxL{qK\\qj)) 

Substituting a* into (J7TJ gives the upper bound in ( |52| ). 

B. Derivation of V(qx,qi,qj) 

Let q K {x) = A f(x;/JK,^x), qi{x) = J\f(x; hi, £/) 
and qj(x) = Af(x; \ij, Ej) and x £ R fc . The integral 
V{qx,qi,qj ) can be written as 

V{q K ,qi,qj) = J q K (f ~ log yy dx 
=D K L{qK\\qj)- J qxqi log ^y-dx 

=DxL{qK\\qj) - (27r) fc / 2 |£ / | 1 / 2 A/’(/ij; Hk, + Ej)x 

i^J AF(x\ /I *, E*) log q K dx - J M{x\ n*, £*) log qj dx^j 

—M 

- 0.5 Tr P7 1 (S J -E k - {pj - px)(pj ~ Pk) T )\ 

- (2 7 r) fe / 2 |E / | 1 / 2 A f(tu; tix, Z K +£/)[- 0.51og \2ttZk\ 

- 0.5 Tr [£^(£* + (jm k - ^)(ji K - M *) T )] 

+ 0.5log |27 tEj| 

+ 0.5 Tr [Ej 1 (E* + ( MJ - f)(»j - //) T )]]. 

where we use the following well-known results for Gaussian 
distributions. 

• Divergence 

D K L(^(x-,tiK^K)\\^(x;iij,T,j)) = 0.5log 

IM 

- 0.5 Tr [EJ^Ej - E* - (fij - ~ Pk) T )\- 

(72) 

• Multiplication 

W(x; (ii, E/)7V(x; Hk, %k) = Hi, £/ + ^k) 

x AA(x;/r*,E*) (73) 
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where 

A 4 * = A 4 / + £/(£/ + VkT\hk ~ A 4 /), (74) 

E* = E 7 -E j (E / + E a: )- 1 E / . (75) 

• Expected Logarithm 

jAf(x; hi, £/) logA/ r (a;; /tj, Ej) dx = -0.5 log |27 tEj| 

- 0.5TV [E7 1 (E / + (a u - A h){^j - A 4 /) T )]- (76) 

• Maximum Value 

q^ ax = (2 7 r)- fe/2 |E / r 1/2 . (77) 



